Data Simulation

Simulating networks

We adopt a top-down approach to simulate hierarchical networks, considering various simulation parameters such as graph sparsity, noise, and the architecture of the super-level graph(s), including small-world, scale-free, and random graph networks (Watts and Strogatz 1998; Barabási and Bonabeau 2003).

Our simulations focus on basic hierarchies comprising one or two hierarchical layers. Two-layer networks mirror classical community detection on graphs, where our aim is to recover the true community labels from a given graph. Meanwhile, three-layer networks present a more intricate scenario, where the bottom layer of the hierarchy contains two levels of community structure. Here, the top level corresponds to the nodes at the uppermost layer of the hierarchy, and the middle level consists of communities nested within the top-level communities. The objective with these networks is to identify both sets of community partitions.

In each hierarchy, for fully connected networks, we initiate by simulating \(N_{\text{top}}\) top-level nodes, adhering to a directed small-world, random graph, or scale-free network architecture (Watts and Strogatz 1998; Barabási and Bonabeau 2003). In cases where the network is disconnected, we simply simulate \(N_{\text{top}}\) disconnected nodes. For networks with three hierarchical layers, we then generate a subnetwork of \(N_{\text{middle}}\) nodes from each top-layer node, adhering to the network structure utilized at the top level. If the network is fully connected, we apply a probability \(p_\text{between}\) to the nodes from different top-level communities being connected.

The final step in all hierarchies is to generate the nodes in the observed (bottom) layer of the hierarchy. For each top-layer or middle-layer node, we generate a sub-network of \(N_{\text{bottom}}\) nodes under the same sub-network structure as the previous layers, and we apply a probability \(p_\text{between}\) for nodes from different communities to share an edge.

Simulating gene expression

Once we simulate a hierarchical graph, we utilize this hierarchy to generate the node-feature matrix, which depicts the expression of \(N\) genes across \(d\) samples. Here, \(N\) denotes the number of nodes in the observed (bottom) layer of the hierarchy.

We simulate the node-feature matrix using the topological order of the observed level graph. We start by generating the features of nodes that have no parental input. We refer to these nodes as origin nodes. All origin nodes are simulated from a normal distribution with mean \(0\) and standard deviation \(\sigma\). All other nodes are simulated from a normal distribution centered at the mean of their parent nodes and with standard deviation \(\sigma\).

Hierarchical Commuity Detection (HCD) Overview

Our HCD method consists of two modules:

  1. A graph autoencoder based on the architecture proposed by (salehi2019graph?) which utilizes graph attention layers such as those first introduced by Veličković et al. (2017) (See most recent version of pseudocode for details). In our applications, we incorporate multi-head attention in all encoder and decoder layers to expand model learning capacity. The graph autoencoder module takes a set of node attributes \({\bf X}\in\mathbb{R}^{N \times d}\) and an adjacency matrix \({\bf A}_{ij}\in[0,1]\) defining the relationships between the node as input and learns a low dimensional embedding \({\bf Z} \in \mathbb{R}^{N \times q}\) of the network and attributes. This embedding is then used to reconstruct the node attributes and adjacency matrix under a separate loss function for each.

  2. The second module of HCD takes the embeddings \({\bf Z}\) generated by the autoencoder and applies a multilevel community detection process. This module first applies the function \(f_{top}\) which partitions the data into \(k\) groups.

Simulation Study

Here we describe the settings for three different sets of simulations. For each set of simulations, we simulated hierarchical gene networks consisting of 5 top level nodes/communities, 15 middle level nodes/communities and 300 nodes/genes at the observed level of the hierarchy. We simulate networks for all three graph types (small world, scale free or random graph) with the nodes at the top level of the hierarchy either all connected or all disconnected. We also simulated nodes under two levels of noise with noise set to either \(\sigma = 0.1\) or \(\sigma = 0.5\). Additionally, we sample the probability for edges within and between node communities at the middle layer and bottom level from the uniform distributions

Each combination of parameters (graph type, connectivity, and noise) is replicated multiple times.

Ablation Study

To evaluate the performance of HCD, we conducted a series of simulation studies under various conditions. Below, we describe each simulation condition in detail:

To further analyze HCD’s clustering performance, we explored five different :

  • No output layer (HCD-NOL) – The learned embedding is directly transformed to the output size \(K_{middle}/k_{top}\) and assigned to clusters via softmax.

the top layer is inferred via

\[ {\bf O} = {\bf Z}{\bf W}_{out} + \beta_{out} \]

\[{\bf P_{top}} = \text{Softmax}({\bf O}) \]

  • Simple linear layer (HCD-Linear) – The embedding is processed through a fully connected linear layer with LeakyReLU activation before soft assignment.

the top layer is inferred via

\[ {\bf M} = {\bf Z} {\bf W}_{1} + \beta_1 \]

\[{\bf M}_{norm} = \text{LayerNorm}({\bf M}) \]

\[ {\bf H} = \text{LeakyReLU}({\bf M}_{norm}) \]

\[ {\bf O} = {\bf H}{\bf W}_{out} + \beta_{out} \]

\[{\bf P_{top}} = \text{Softmax}({\bf O}) \]

  • SAGEConv layer (HCD-SAGE) – The embedding is processed through a GraphSAGE convolutional layer before converting it to logits and soft probabilities (Hamilton, Ying, and Leskovec 2017).

\[{\bf h}_v = \phi\left( {\bf W}_1 · CONCAT\big[{\bf z}_v,\ AGGREGATE\big({\bf z}_u : u \in \mathcal{N}(v) \big) \big]\right)\]

\[{\bf o}_v = {\bf W}_{out} · {\bf h}_v + \beta_{out}\]

\[{\bf p}_v = \text{Softmax}({\bf o}_v)\]

  • GATConv layer (HCD-GAT) – The embedding is processed through a graph attention layer before computing logits and soft probabilities (Veličković et al. 2017; Brody, Alon, and Yahav 2021).

\[{\bf h}_v = \phi\left(\sum_{u \in \mathcal{N}(v)} \alpha_{v,u} \cdot {\bf W} {\bf z}_u\right)\]

\[ {\bf o}_v = {\bf W}_{out} · {\bf h}_v + \beta_{out}\]

\[ {\bf p}_v = \text{Softmax}({\bf o}_v)\]

where

\[ \alpha_{v,u} = \text{Softmax}\left({\bf a}^T \text{LeakyReLU}\left({\bf W} \cdot [ {\bf z}_v || {\bf z}_u]\right)\right)\]

  • KMeans partitioning (HCD-Kmeans) – The top hierarchical layer is clustered using a deep learning-compatible KMeans algorithm from the torch_kmeans library. Each middle layer is then partitioned using the “No output layer” strategy (Falkner 2022).

We evaluated HCD under four output size determination strategies:

  • Ground truth-based output size – Fixed at 5 communities in the top layer and 15 in the middle layer.
  • Fixed output size of \([5, 64]\).
  • Bethe Hessian estimate – Using \(/hat{\kappa}\) for the middle layer and \(\frac{1}{2} \hat{\kappa}\) for the top layer.
  • Silhouette maximization – Selecting the output size based on silhouette score optimization.

These simulation conditions are designed to rigorously assess the performance of HCD under diverse scenarios. By varying the partitioning methods, output sizes, and use of prior information, we aim to gain insights into HCD’s robustness, adaptability, and effectiveness in uncovering hierarchical community structures in genomic data.

Evaluating performance

We evaluate the performance of our HCD method using three graph-based clustering metrics:

  1. homogeneity evaluates the degree to which each predicted community contains only data points from a single true community, indicating how well the algorithm avoids mixing different groups. Thus, homogeneity tends to be high if resolved communities contain only members of the same true community.

  2. completeness assesses the extent to which all data points that belong to the same true community are correctly assigned to a single predicted community. Thus completeness is always high if all members of the same true communities end up in the same resolved community even if several true communities are allocated together.

  3. NMI Normalized Mutual Information (NMI) is a weighted average of the Completeness and Homogeneity two metrics.

  4. ARI The Adjusted Rand Index (ARI) is a metric that measures the similarity between two different clusterings of the same data, correcting for the chance of random agreement. It ranges from -1 to 1, where 1 indicates perfect agreement between the clusterings, 0 represents random labeling, and negative values indicate less agreement than expected by chance.

In all simulations we compare HCD with two commonly used heuristic methods:

  • Louvain Method: This widely used community detection algorithm optimizes modularity by iteratively reassigning nodes between communities and merging communities. It automatically determines the number of communities and is highly efficient for processing large networks. In all simulations, we apply the Louvain method to the estimated adjacency matrix, which is derived from the correlation matrix of the node features. The resulting community predictions are compared to the ground truth for both the top and middle layers of the simulated hierarchy.

  • Hierarchical Clustering using Ward’s Distance (HC): This agglomerative clustering approach iteratively merges clusters to minimize the increase in within-cluster variance. Ward’s method tends to produce compact, spherical clusters and generates a dendrogram representing the full hierarchical structure of the data. In all simulations, HC is applied to the simulated gene expression data (node feature matrix). The optimal community predictions are determined by identifying the best cutting point on the dendrogram, aligning with the ground truth clusters (five clusters and 15 clusters).

Ways of Estimating K

There are several classical approaches to estimating the number of communities, \(k\), in unsupervised learning. Among these, two commonly adopted data-driven approaches are the Elbow Method and the optimization of a cluster quality metric, such as the Silhouette Score. Additionally, we also consider a graph-based approach known as the Bethe Hessian estimate.

  • Elbow Method:
    The Elbow Method identifies the optimal number of clusters by plotting the clustering cost (e.g., within-cluster sum of squares) against the number of clusters, \(k\). The “elbow” point, where the rate of decrease in the cost slows significantly, is taken as the optimal \(k\). This method assumes that beyond this point, adding more clusters does not improve clustering quality significantly.

  • Silhouette Method:
    The Silhouette Method evaluates the quality of clustering by measuring how similar data points are to their assigned cluster (cohesion) compared to other clusters (separation). The Silhouette score ranges from -1 to 1, where higher values indicate better-defined clusters. The optimal \(k\) is the value that maximizes the average Silhouette score across all data points (Rousseeuw 1987).

  • Bethe Hessian:
    The Bethe Hessian is a graph-based method for estimating the number of communities in a network. It involves constructing the Bethe Hessian matrix \({\bf B}_\eta = (\eta^2 - 1){\bf I} + {\bf D} - \eta {\bf A}\), where \(A\) is the adjacency matrix, \(D\) is the degree matrix, and \(\eta\) is a regularization parameter. The number of negative eigenvalues of \(H(r)\) provides an estimate of the number of communities. This approach leverages spectral properties of the graph to identify structural groupings. According to (Schaub, Li, and Peel 2023), the Bethe Hessian consistently estimates the finest possible detectable partition. Based on this, we propose using this estimate to determine the number of communities in the middle layer of the hierarchy and setting the number of partitions in the top layer to either half or one-third of this value.

Results

  • HCD demonstrates superior performance on the middle layer in fully connected networks.
  • HCD achieves performance comparable to Hierarchical Clustering on the top layer across all scenarios.
  • HCD-NOL performs comparably or better to more complex configurations like HCD-Linear, HCD-SAGE, or HCD-GAT suggesting the additional embedding processing step is unnecessary.
Clustering performance of HCD, Louvain, and Hierarchical Clustering for the middle hierarchcial layer across all simulated networks, where the number of communities is estimated using the Bethe Hessian estimate. "HC-truth" represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the middle hierarchcial layer across all simulated networks, where the number of communities is estimated using the Bethe Hessian estimate. “HC-truth” represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the top hierarchical layer across all simulated networks, where the number of communities is estimated using the Bethe Hessian estimate. "HC-truth" represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the top hierarchical layer across all simulated networks, where the number of communities is estimated using the Bethe Hessian estimate. “HC-truth” represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the middle hierarchcial layer across all simulated networks, where the number of communities is estimated using the Bethe Hessian estimates. "HC-truth" represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the middle hierarchcial layer across all simulated networks, where the number of communities is estimated using the Bethe Hessian estimates. “HC-truth” represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the top hierarchical layer across all simulated networks, where the number of communities is estimated using the Bethe Hessian estimates. "HC-truth" represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the top hierarchical layer across all simulated networks, where the number of communities is estimated using the Bethe Hessian estimates. “HC-truth” represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the middle hierarchcial layer across all simulated networks, where the number of communities is estimated using the Silouette estimates. "HC-truth" represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the middle hierarchcial layer across all simulated networks, where the number of communities is estimated using the Silouette estimates. “HC-truth” represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the top hierarchical layer across all simulated networks, where the number of communities is estimated using the Silouette estimates. "HC-truth" represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance of HCD, Louvain, and Hierarchical Clustering for the top hierarchical layer across all simulated networks, where the number of communities is estimated using the Silouette estimates. “HC-truth” represents hierarchical clustering performance when the ground-truth partition of 15 communities is used, serving as a baseline for comparison. In this figure, HCD is evaluated using the No Output Layer scenario, where the learned embedding is directly assigned to clusters via softmax without additional processing.

Clustering performance, measured by the Adjusted Rand Index (ARI), for all HCD variants, Hierarchical Clustering, and Louvain on fully connected networks, where the number of communities is estimated using the Bethe Hessian. This scenario leverages the spectral properties of the network to detect the finest possible partition, providing a principled estimate of the community structure. The results highlight how different clustering methods perform when the number of clusters is inferred rather than known a priori.

Clustering performance, measured by the Adjusted Rand Index (ARI), for all HCD variants, Hierarchical Clustering, and Louvain on fully connected networks, where the number of communities is estimated using the Bethe Hessian. This scenario leverages the spectral properties of the network to detect the finest possible partition, providing a principled estimate of the community structure. The results highlight how different clustering methods perform when the number of clusters is inferred rather than known a priori.

Clustering performance, measured by the Adjusted Rand Index (ARI), for all HCD variants, Hierarchical Clustering, and Louvain on disconnected networks, where the number of communities is estimated using the Bethe Hessian. This scenario leverages the spectral properties of the network to detect the finest possible partition, providing a principled estimate of the community structure. The results highlight how different clustering methods perform when the number of clusters is inferred rather than known a priori.

Clustering performance, measured by the Adjusted Rand Index (ARI), for all HCD variants, Hierarchical Clustering, and Louvain on disconnected networks, where the number of communities is estimated using the Bethe Hessian. This scenario leverages the spectral properties of the network to detect the finest possible partition, providing a principled estimate of the community structure. The results highlight how different clustering methods perform when the number of clusters is inferred rather than known a priori.

An example of an easy network

The following is an example of a network that is easy to infer for both HCD, HC, and the louvain method. This example is a small world disconnected network with standard deviation \(\sigma = 0.1\).

Clustering Performance for HCD using no output layer, Hierarchcial Clustering and the Louvain method on an easy to infer small world network. Results are shown when the output size is the ground truth and when it is inferred using bethe hessian or silouette techniques.

Clustering Performance for HCD using no output layer, Hierarchcial Clustering and the Louvain method on an easy to infer small world network. Results are shown when the output size is the ground truth and when it is inferred using bethe hessian or silouette techniques.

correlations on the embeddings and data
correlations on the embeddings and data

An example of a difficult network

The following is an example of a network that is difficult to infer. This example is a small world fully connected network with standard deviation \(\sigma = 0.5\).

Clustering Performance for HCD using no output layer, Hierarchcial Clustering and the Louvain method on a small world fully connected network that is difficult to infer. Results are shown when the output size is the ground truth and when it is inferred using bethe hessian or silouette techniques.

Clustering Performance for HCD using no output layer, Hierarchcial Clustering and the Louvain method on a small world fully connected network that is difficult to infer. Results are shown when the output size is the ground truth and when it is inferred using bethe hessian or silouette techniques.

correlations on the embeddings and data
correlations on the embeddings and data

References

Barabási, Albert-László, and Eric Bonabeau. 2003. “Scale-Free Networks.” Scientific American 288 (5): 60–69.
Brody, Shaked, Uri Alon, and Eran Yahav. 2021. “How Attentive Are Graph Attention Networks?” arXiv Preprint arXiv:2105.14491.
Falkner, J. K. 2022. “Torch-Kmeans.” GitHub Repository. https://github.com/jokofa/torch_kmeans; GitHub.
Hamilton, Will, Zhitao Ying, and Jure Leskovec. 2017. “Inductive Representation Learning on Large Graphs.” Advances in Neural Information Processing Systems 30.
Rousseeuw, Peter J. 1987. “Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis.” Journal of Computational and Applied Mathematics 20: 53–65.
Schaub, Michael T, Jiaze Li, and Leto Peel. 2023. “Hierarchical Community Structure in Networks.” Physical Review E 107 (5): 054305.
Veličković, Petar, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. “Graph Attention Networks.” arXiv Preprint arXiv:1710.10903.
Watts, Duncan J, and Steven H Strogatz. 1998. “Collective Dynamics of ‘Small-World’networks.” Nature 393 (6684): 440–42.